Detecting Co-Derivative Documents in Large Text Collections

نویسندگان

Jan Pomikálek

Pavel Rychlý

چکیده

We have analyzed the SPEX algorithm by Bernstein and Zobel [1] for detecting co-derivative documents using duplicate n-grams. Though we totally agree with the claim that not using unique n-grams can greatly increase efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accurate discovery of co-derivative documents via duplicate text detection

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences,...

متن کامل

Methods for Identifying Versioned and Plagiarised Documents

The widespread use of online publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarising the work of others. We evaluate two families of methods for searching a collection to find documents that are co-derivative, that is, are versions or plagiarisms of each other. The first, the ranking ...

متن کامل

A Scalable System for Identifying Co-derivative Documents

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, ...

متن کامل

Passage Selection To Improve Question Answering

Open-Domain Question Answering systems (QA) performs the task of detecting text fragments in a collection of documents that contain the response to user’s queries. These systems use high complexity tools that reduce its applicability to the treatment of small amounts of text. Consequently, when working on large document collections, QA systems apply Information Retrieval (IR) techniques to redu...

متن کامل

Detecting Short Passages of Similar Text in Large Document Collections

This paper presents a statistical method for fingerprinting text. In a large collection of independently written documents each text is associated with a fingerprint which should be different from all the others. If fingerprints are too close then it is suspected that passages of copied or similar text occur in two documents. Our method exploits the characteristic distribution of word trigrams,...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Detecting Co-Derivative Documents in Large Text Collections

نویسندگان

چکیده

منابع مشابه

Accurate discovery of co-derivative documents via duplicate text detection

Methods for Identifying Versioned and Plagiarised Documents

A Scalable System for Identifying Co-derivative Documents

Passage Selection To Improve Question Answering

Detecting Short Passages of Similar Text in Large Document Collections

عنوان ژورنال:

اشتراک گذاری